ImpactMojo
Premium

Statistical Assumptions Checklist & Troubleshooting Guide

Practical guide for checking assumptions and handling violations in South Asian development research

Why Assumptions Matter in South Asian Development Research

General Assumption-Checking Workflow

  1. Pre-analysis: Check data quality, sample size, variable types
  2. Assumption testing: Use diagnostic tests and visualizations
  3. Violation assessment: Determine severity and impact
  4. Choose solution: Transform data, use robust methods, or acknowledge limitations
  5. Document decisions: Report assumption tests and remedial actions

CORRELATION ANALYSIS ASSUMPTIONS

Pre-Analysis Checklist

Assumption 1: Linearity

How to Check:
• Create scatterplot of X vs Y
• Look for straight-line pattern
• Check residuals plot (if using regression)
Violation Signs:
• Curved pattern in scatterplot
• U-shaped or inverted U-shape
• Relationship changes direction

Solutions for Non-linearity:

South Asian Context:

Common non-linear relationships: Income vs health outcomes (diminishing returns at high income), Education vs fertility (steep decline then plateau), Distance vs service use (threshold effects)

Assumption 2: Normal Distribution

How to Check:
• Histogram of each variable
• Q-Q plots
• Shapiro-Wilk test (n < 50)
• Kolmogorov-Smirnov test (n ≥ 50)
Violation Signs:
• Skewed distributions
• Multiple peaks
• Extreme outliers
• p < 0.05 in normality tests

Solutions for Non-normality:

Assumption 3: Homoscedasticity (Equal Variance)

How to Check: Look at scatterplot - variance should be similar across all X values
Violation (Heteroscedasticity) Signs: Fan-shaped pattern, variance increases/decreases with X

Solutions:

South Asian Examples:

Income data: Often highly skewed (few very wealthy households). Agricultural yields: May have different variance across farm sizes. Education scores: Floor/ceiling effects common.

ANOVA ASSUMPTIONS

Pre-Analysis Checklist

Assumption 1: Independence of Observations

Critical Assumption - Violations Seriously Affect Results

Solutions for Dependence:

South Asian Context:

Village clustering: People in same village share infrastructure, weather, policies. Household clustering: Family members share economic conditions. Regional clustering: States/districts have different governance.

Assumption 2: Normality of Residuals

How to Check:
• Histogram of residuals
• Q-Q plot of residuals
• Shapiro-Wilk test on residuals
• Check each group separately
Violation Signs:
• Skewed residual distribution
• Heavy tails in Q-Q plot
• Significant normality test
• Different shapes across groups

Solutions for Non-normality:

Assumption 3: Homogeneity of Variance (Homoscedasticity)

How to Check:
• Levene's test
• Bartlett's test (sensitive to normality)
• Box plot comparison
• Residuals vs fitted values plot
Violation Signs:
• Significant Levene's test (p < 0.05)
• Very different group variances
• Rule of thumb: largest/smallest variance > 4:1

Solutions for Unequal Variances:

South Asian Development Examples:

Income data: Control groups often have less variance than treatment groups. Test scores: Rural vs urban schools may have very different variance. Agricultural yields: Irrigated vs rain-fed areas show different variability.

Special Considerations for Development Data

REGRESSION ANALYSIS ASSUMPTIONS

Pre-Analysis Checklist

Assumption 1: Linearity

How to Check:
• Scatterplots of Y vs each X
• Residuals vs fitted values plot
• Component-plus-residual plots
• Added variable plots
Violation Signs:
• Curved patterns in scatterplots
• Systematic patterns in residuals
• Poor model fit despite significance

Solutions for Non-linearity:

Assumption 2: Independence of Residuals

Critical Assumption - Often Violated in Development Data

Solutions for Dependence:

Assumption 3: Homoscedasticity

How to Check:
• Residuals vs fitted values plot
• Breusch-Pagan test
• White test
• Plot residuals vs each predictor
Violation Signs:
• Fan-shaped residual pattern
• Variance increases with fitted values
• Significant heteroscedasticity tests

Solutions for Heteroscedasticity:

Assumption 4: Normality of Residuals

How to Check: Histogram of residuals, Q-Q plot, Shapiro-Wilk test (on residuals, not original data)

Solutions for Non-normal Residuals:

Assumption 5: No Multicollinearity

How to Check:
• Correlation matrix of predictors
• Variance Inflation Factor (VIF)
• Condition indices
• Tolerance values
Violation Signs:
• High correlations (|r| > 0.8)
• VIF > 5 (some say > 10)
• Tolerance < 0.2
• Unstable coefficients

Solutions for Multicollinearity:

Common Multicollinearity in South Asian Development:

Education & Income: Highly correlated. Infrastructure variables: Water, electricity, roads often bundled. Health indicators: Multiple nutrition measures. Geographic variables: Rainfall, temperature, elevation may be collinear.

Assumption 6: No Influential Outliers

How to Check:
• Cook's distance (> 1 problematic)
• Leverage values (> 2k/n)
• Studentized residuals (> ±3)
• DFBETAS (> 2/√n)
Outlier Types:
• High leverage (unusual X values)
• High residual (unusual Y values)
• High influence (affects coefficients)
• May be genuine or errors

Handling Outliers:

Outliers in Development Data:

Success stories: Exceptionally successful interventions. Extreme poverty: Households with very low income/assets. Urban-rural differences: Urban areas in rural samples. Data errors: Recording mistakes, unit confusion.

Assumption Violation Severity Guide

Assumption Method Severity if Violated Primary Consequence
Independence All methods CRITICAL Invalid p-values, wrong conclusions
Linearity Correlation, Regression HIGH Missed relationships, poor predictions
Normality All methods MEDIUM (with large n) Slightly inaccurate p-values
Homoscedasticity ANOVA, Regression MEDIUM Inefficient estimates, wrong SE
Multicollinearity Regression MEDIUM Unstable coefficients, interpretation issues

South Asian Development Data: Common Issues & Solutions

Typical Data Challenges

Challenge Description Statistical Impact Recommended Solution
Seasonal effects Agricultural data varies by monsoon Non-independence, heteroscedasticity Include season controls, cluster by year
Village clustering Households in same village are similar Independence violation Cluster-robust SE, multilevel models
Extreme inequality Very skewed income distributions Non-normality, outliers Log transformation, robust methods
Missing data patterns Non-random missingness Selection bias Multiple imputation, selection models
Floor/ceiling effects Many zero values or maximum scores Non-normality, non-linearity Tobit models, transformations

Practical Recommendations

Remember: Perfect Data Doesn't Exist

The goal is not to meet every assumption perfectly, but to: